Board Game Geek Rating Prediction Based on Comments.

Author: Md Mintu Miah, ID:1001405116

Executive Summary

Text analysis is becoming important for Revealing imformation from text content. Due to advancement of machine learning algorithm, now the text analysis is very flexible and interesting in Neuro-lingistic field. The purpose of this project was predicting the game rating based on comments left by the users. For the purpose of the analysis, 30,000 unbalanced and 20,000 balanced random sample was taken from orginal board game geek data. To obtain the best accuracy MNB, MNB with N-grams, linear-SVC, Ensemble model for balanced SVC and Unbalanced SVC under joining condition and VotingClassifier condition was conducted. The accuracy for different experiment in case of unbalaced data set was: MNB accuaracy 27% with hyperparameter 1, MNB with N-grams 27% and SVC accuracy was 29%. But there was problem with these models due to unbalaced data used for training the model, it was not able to capture most of the negative rating (<5) and very high positive rating (>8). To overcome this problem a 20,000 balanced sample was created by taking 2000 samples from each rating. Then the SVC model was trained with balanced sample, after that SVC model was used to predict on unbalanced dataset. It was found that balanced training model accuracy was less than unbalced models, but it was able to capture all kinds of rating. SVC Balanced accuracy was 23% and SVC-unbalanced test accuracy was 20%, here we train SVC model only as we got that SVC performs better than MNB and MNB-N-grams in terms of the accuracy of the prediction. This project conducted two types of ensemable models: Voting Ensemable shows accuacy 29% while Joining Ensemble shows 66% accuracy which was outstanding results for this project. SVC balnced and SVC unbalced model was joined in this best model to predict on unbalced data.This study concludes that Ensemble model performs best than any other models with highest accuarcy. The challanging of this project was selecting the sample size and finding the best machine leaning algorithm and impliment them in a proper way.

Introduction

The Board Game Geek (BGG) database is a collection of data and information on traditional board games. The game information was recorded to intend for posterity, historical research, and user-contributed ratings. All the information within the database was meticulously and voluntarily entered on a game-by-game basis by board game user. This information is freely offered through flexible queries and "data mining". BoardGameGeek's ranking charts are ordered using the BGG Rating, which is based on the Average Rating. Game Rating was scaled 1 to 10 to present the sentiment. Understanding the popularity of the game depends on information provided by users, which is very important. For this project, board game reviews was used to predict the rating of the game using Machine learning Algorithm. Three kinds of Machine learning Algorithms (MNB, MNB_N-grams,SVC and Ensemble models) was used here for the whole projects.

Data Description

The orginal Board game data is vast (1GB-1,31,70,073 rows data for review file) and time consuming process to clean, that needs excellent memory of the computer. So to avoid complexity this project has taken 30,000 unbalnced and 20,000 balanced data set for the aanalysis and model development. The file contains

  • GameID
  • Rating
  • comment

Purpose of the Project

The main purpose of the project is predicting the rating of the game based on the given reviews and understanding how the test classification machine learning algorithm works and improving the outputs on existing references. The second purpose is providing good documentation for the whole process

Naive Bayes Classifier

The Naive Bayes classifier is a simple probabilistic classifier which is based on Bayes theorem with strong and naïve independence assumptions. It is one of the most basic text classification techniques with various applications in email spam detection, personal email sorting, document categorization, sexually explicit content detection, language detection and sentiment detection. Despite the naïve design and oversimplified assumptions that this technique uses, Naive Bayes performs well in many complex real-world problems.

Multinomial Naive Bayes

Multinominal Naive Bayes (MNB) algorithm has been widely used in text classification due to its computational advantage and simplicity. MNB maximizes likelihood rather than conditional likelihood or accuracy.The task of text classification can be approached from a Bayesian learning perspective, which assumes that the word distributions in documents are generated by a specific parametric model, and the parameters can be estimated from the training data. Beolow Equation shows Multinominal Naive Bayes (MNB) model which is one such parametric model commonly used in text classification image.png where fi is the number of occurrences of a word wi in a document d, P(wijc) is the conditional probability that a word wi may happen in a document d given the class value c, and n is the number of unique words appearing in the document d.Conditional probability P(wijc) can be determined using the relative frequency of the word wi in documents belonging to class c. image.png where fic is the number of times that a word wi appears in all documents with the class label c, and fc is the total number of words in documents with class label c in T.

One advantage of the Multinominal Naive Bayes model is that it can make predictions efficiently.

Multinomial Naive Bayes with N-grams

An n-gram is defined either as a textual sequence of length n, or similarly, as a sequence of n adjacent ‘textual units’, in both cases extracted from a particular document. A ‘textual unit’ can be identified at a byte, character or word level depending on the context of interest. N-Grams are the basic method for text categorization. It is also a statistical based approach for classifying text. The N is the number of keywords used for dividing the input text. Based on the number of keywords used, the N-gramsare called as 2-grams, 3-grams, etc.

Support Vector Machine- Linear SVC

Linear SVM is the newest extremely fast machine learning algorithm for solving multiclass classification problems from ultra large data sets that implements an original proprietary version of a cutting plane algorithm for designing a linear support vector machine.The objective of a Linear SVC (Support Vector Classifier) is to fit to the data, returning a "best fit" hyperplane that divides, or categorizes the training data.After getting the hyperplane, then the model can be feeded with test sample to classify the "predicted" class.

image.png SVM uses kernel function, which finds the linear hyperplane that separates classes with the maximum margin. The above diagram shows how the data points (that is, support vectors) belonging to two different classes (red versus blue) are separated using the decision boundary based on the maximum margin.

Ensemble Model

Ensemble modeling is a process where multiple diverse models are created to predict an outcome, either by using many different modeling algorithms or using different training data sets. The ensemble model then aggregates the prediction of each base model and results in once final prediction for the unseen data. The motivation for using ensemble models is to reduce the generalization error of the prediction.Every model has its strengths and weaknesses. Ensemble models can be beneficial by combining individual models to help hide the weaknesses of an individual model.

image.png

Voting classification techniqes in the ensemable predicts based on major votes. For example, if we use three models and they predict [1,0,1]target variable, the final prediction that the ensemble model would make would be 1, since two out of the three models predicted 1.

Analysis Steps or Methods

Unbalanced Sample Analysis with MNB, MNB-N-Grams and SVC -

  1. Board Game Geek Data Exploration
  2. Cleaning the data
  3. Taking 30,000 unbalanced random Sample
  4. Understanding the unbalanced sample
  5. Top 50 negative and Positive word identification and word cloud for them
  6. Multinomial Naive Bayes,Multinomial Naive Bayes with N-grams(1,2) and Linear SVC model for unbalanced Sample

Balanced Sample Analysis with Best Model SVC-

  1. Forming 10,000 random Balanced Sample from orginal data with 2000 samples from each rating (1-10)
  2. Traing the best performed SVC model with Balanced Data set
  3. Apply the balanced trained SVC model to predict unbalanced test data

Ensemble model Development with Balanced and unbalanced SVC model

  1. Ensemble model with voting classifier
  2. Ensemble model with joining Balced and Unbalced SVC model

Now lets start with our Board Game Geek Data

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
/kaggle/input/boardgamegeek-reviews/bgg-13m-reviews.csv
/kaggle/input/boardgamegeek-reviews/games_detailed_info.csv
/kaggle/input/boardgamegeek-reviews/2019-05-02.csv
/kaggle/input/bgg-comments/boardgame-comments-sample.csv
/kaggle/input/board-game-greck-reviews/test_predictions.csv
/kaggle/input/board-game-greck-reviews/reviews_sampled.csv

Import Library

In [2]:
import pandas as pd
import numpy as np
import string
import nltk
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn.svm import LinearSVC
from sklearn import svm, linear_model
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from math import sqrt
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict
from sklearn.ensemble import VotingClassifier 
sns.set(color_codes=True)
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import KFold, cross_val_score, train_test_split
import random
from sklearn.metrics import accuracy_score
from collections import Counter
from sklearn.metrics import accuracy_score

Lets start with importing our orinal data source file

In [3]:
review_data0 = pd.read_csv('../input/boardgamegeek-reviews/bgg-13m-reviews.csv', index_col=0)
review_data0.head()
/opt/conda/lib/python3.7/site-packages/numpy/lib/arraysetops.py:569: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  mask |= (ar1 == a)
Out[3]:
user rating comment ID name
0 sidehacker 10.0 NaN 13 Catan
1 Varthlokkur 10.0 NaN 13 Catan
2 dougthonus 10.0 Currently, this sits on my list as my favorite... 13 Catan
3 cypar7 10.0 I know it says how many plays, but many, many ... 13 Catan
4 ssmooth 10.0 NaN 13 Catan

** we will use .shape to see the number of rows and columns in our data file

In [4]:
review_data0.shape
Out[4]:
(13170073, 5)

Orginal data file has 13,170,073 rows without cleaning and 5 columns. We will Remove all NaN rows from comment columns

So, Remove all NaN rows from comment column

In [5]:
review_data2=review_data0[~review_data0.comment.str.contains("NaN",na=True)]
review_data2.head()
Out[5]:
user rating comment ID name
2 dougthonus 10.0 Currently, this sits on my list as my favorite... 13 Catan
3 cypar7 10.0 I know it says how many plays, but many, many ... 13 Catan
7 hreimer 10.0 i will never tire of this game.. Awesome 13 Catan
11 daredevil 10.0 This is probably the best game I ever played. ... 13 Catan
16 hurkle 10.0 Fantastic game. Got me hooked on games all ove... 13 Catan

We removed all missing comments row, now lets see the shape of the file

In [6]:
review_data2.shape
Out[6]:
(2637755, 5)

Stll our current data table has 26,37,755 rows and 5 columns which is huge. Before taking sample, lets check the description of data and rating bar graph to get the idea of rating frequency in whole data

In [7]:
review_data2.describe()
Out[7]:
rating ID
count 2.637755e+06 2.637755e+06
mean 6.852071e+00 6.693992e+04
std 1.775769e+00 7.304448e+04
min 1.401300e-45 1.000000e+00
25% 6.000000e+00 3.955000e+03
50% 7.000000e+00 3.126000e+04
75% 8.000000e+00 1.296220e+05
max 1.000000e+01 2.724090e+05
In [8]:
#plot histogram of ratings
num_bins = 70
n, bins, patches = plt.hist(review_data2.rating, num_bins, facecolor='green', alpha=0.9)

#plt.xticks(range(9000))
plt.title('Histogram of Ratings')
plt.xlabel('Ratings')
plt.ylabel('Count')
plt.show()

From histogram, it is clear that the highest rating frquency is 7 and than 8,6. The rating is included decimal like 4.2,4.3, 4.4,4.5,5.5,6.5,7.5 and others. But for our analysis we will take the rounded intiger value of rating.

It is clear that orginal data size is too big (13170073 rows) and needs lot of memory to calculate, so we will take subsample data to develop the model. I have taken 30,000 data for developing the model.

Now lets take 30,000 samples (Unbalanced)

In [9]:
review_data2.head()
review_data3=review_data2.sample(n=30000)
review_data3.head()
Out[9]:
user rating comment ID name
9058383 feldfan2014 8.0 Replaced with English version 90040 Pergamon
3135010 jimmyhudson 4.0 Theme isn't that interesting to me so I know s... 100901 Flash Point: Fire Rescue
300 Schwarzie2478 8.0 Everything plays realy smooth, it's good that ... 122842 Exodus: Proxima Centauri
2101261 dipplestix 9.0 This game is excellent (with all three expansi... 2655 Hive
329336 fateswanderer 9.0 Speedy, casual game with a fantastic mechanic ... 36218 Dominion
  • We need only rating and comment columns, but we will keep all as it will not hamper our preocess of analysis
  • Lets check data type with dtypes command
In [10]:
review_data3.dtypes
Out[10]:
user        object
rating     float64
comment     object
ID           int64
name        object
dtype: object
  • our current rating data is float type and comment is object or combination of text, links, numbers and so on.
  • Lets check if we have any missing data yet
In [11]:
review_data3.isna().sum()
Out[11]:
user       0
rating     0
comment    0
ID         0
name       0
dtype: int64
  • we do not have any missing data as we already removed all missing data very begining of the data exploration

Plot histogram of word count

In [12]:
review_data3['word_count']  = review_data3.comment.str.len()

num_bins = 70
n, bins, patches = plt.hist(review_data3.word_count, num_bins, facecolor='green', alpha=0.9)

#plt.xticks(range(9000))
plt.title('Histogram of Word Count')
plt.xlabel('Word Count')
plt.ylabel('Count')
plt.show()
  • The above histrogram shows us the frequency of word in our data set.

Making lowercase, removing punctuation and stop words

In [13]:
#lowercase and remove punctuation
review_data3['cleaned'] = review_data3['comment'].str.lower().apply(lambda x:''.join([i for i in x if i not in string.punctuation]))

# stopword list to use
stopwords_list = stopwords.words('english')
stopwords_list.extend(('game','play','played','players','player','people','really','board','games','one','plays','cards','would')) 

stopwords_list[-10:]

#remove stopwords
review_data3['cleaned'] = review_data3['cleaned'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stopwords_list)]))
review_data3.head()
Out[13]:
user rating comment ID name word_count cleaned
9058383 feldfan2014 8.0 Replaced with English version 90040 Pergamon 29 replaced english version
3135010 jimmyhudson 4.0 Theme isn't that interesting to me so I know s... 100901 Flash Point: Fire Rescue 95 theme isnt interesting know someone else proba...
300 Schwarzie2478 8.0 Everything plays realy smooth, it's good that ... 122842 Exodus: Proxima Centauri 129 everything realy smooth good turn simultanuous...
2101261 dipplestix 9.0 This game is excellent (with all three expansi... 2655 Hive 455 excellent three expansions base lbm makes adva...
329336 fateswanderer 9.0 Speedy, casual game with a fantastic mechanic ... 36218 Dominion 957 speedy casual fantastic mechanic allows massiv...

We have made lower case of all words in comment, removed punctuation and stop words to get the unique, meaningful and clear comments for the analysis. Lower case will halp to get same format of similar kinds of word. stopwords do not carry any meaningful significance.

  • Stop Words: Stop words are basically a set of commonly used words in any language, not just English.The reason why stop words are critical to many applications is that, if we remove the words that are very commonly used in a given language, we can focus on the important words instead. For example, in the context of a search engine, if your search query is “how to develop information retrieval applications”, If the search engine tries to find web pages that contained the terms “how”, “to” “develop”, “information”, ”retrieval”, “applications” the search engine is going to find a lot more pages that contain the terms “how”, “to” than pages that contain information about developing information retrieval applications because the terms “how” and “to” are so commonly used in the English language. If we disregard these two terms, the search engine can actually focus on retrieving pages that contain the keywords: “develop” “information” “retrieval” “applications” – which would bring up pages that are actually of interest.

plot histogram of ratings of our unbalanced sample

In [14]:
num_bins = 70
n, bins, patches = plt.hist(review_data3.rating, num_bins, facecolor='green', alpha=0.9)

#plt.xticks(range(9000))
plt.title('Histogram of Ratings')
plt.xlabel('Ratings')
plt.ylabel('Count')
plt.show()

So, it is clear that our unbalced sample has similar rating pattern like orginal sample.

Now lets see top 50 common words

In [15]:
Counter(" ".join(review_data3["cleaned"]).split()).most_common(50)[:50]
Out[15]:
[('like', 6015),
 ('fun', 5959),
 ('good', 4504),
 ('great', 3750),
 ('much', 3372),
 ('get', 3178),
 ('time', 3141),
 ('card', 2344),
 ('playing', 2308),
 ('rules', 2272),
 ('first', 2257),
 ('little', 2230),
 ('well', 2224),
 ('better', 2144),
 ('lot', 2107),
 ('dont', 2075),
 ('bit', 2008),
 ('still', 1950),
 ('love', 1941),
 ('theme', 1939),
 ('also', 1876),
 ('interesting', 1871),
 ('think', 1815),
 ('nice', 1793),
 ('2', 1757),
 ('best', 1703),
 ('even', 1652),
 ('make', 1646),
 ('easy', 1640),
 ('many', 1617),
 ('simple', 1598),
 ('im', 1553),
 ('two', 1526),
 ('dice', 1491),
 ('long', 1449),
 ('strategy', 1445),
 ('way', 1424),
 ('though', 1410),
 ('enough', 1378),
 ('different', 1375),
 ('luck', 1327),
 ('quite', 1281),
 ('see', 1272),
 ('3', 1270),
 ('pretty', 1240),
 ('rating', 1230),
 ('de', 1216),
 ('could', 1194),
 ('take', 1193),
 ('feel', 1188)]

Like, fun, good, great,much, get,time, rules, well, playing are the top 10 words repeated within the comment. Now, lets define the rating as positive (rating>8) and negative (<3) and display top 100 positive words and negative words that are very useful to predict the rating.

Lets see word clouds for positive words and negative words

In [16]:
from wordcloud import WordCloud
from collections import Counter

neg = review_data3.loc[review_data3['rating'] < 3]
pos = review_data3.loc[review_data3['rating'] > 8]


words = Counter([w for w in " ".join(pos['cleaned']).split()])

wc = WordCloud(width=400, height=350,colormap='plasma',background_color='white').generate_from_frequencies(dict(words.most_common(100)))
plt.figure(figsize=(20,15))
plt.imshow(wc, interpolation='bilinear')
plt.title('Common Words in Positive Reviews', fontsize=20)
plt.axis('off');
plt.show()


words = Counter([w for w in " ".join(neg['cleaned']).split()])

wc = WordCloud(width=400, height=350,colormap='plasma',background_color='white').generate_from_frequencies(dict(words.most_common(100)))
plt.figure(figsize=(20,15))
plt.imshow(wc, interpolation='bilinear')
plt.title('Common Words in Negative Reviews', fontsize=20)
plt.axis('off');
plt.show()
  • A word cloud of positive (rating > 8) and negative (rating < 3) reviews was generated as above. The positive word cloud contains mostly positive 100 words, the negative word cloud contains a mix of 100 words that are not necessarily negative

lets check mean, median and mode of rating of our unbalced sample

In [17]:
print('Mean: ', review_data3.rating.mean())
print('Median: ', review_data3.rating.median())
print('Mode: ', review_data3.rating.mode())
Mean:  6.856034572666648
Median:  7.0
Mode:  0    7.0
dtype: float64

Now define Necessary function to calculate RMSE,Weighted RMSE MAF and model assessment

In [18]:
def calc_rmse(errors, weights=None):
    n_errors = len(errors)
    if weights is None:
        result = sqrt(sum(error ** 2 for error in errors) / n_errors)
    else:
        result = sqrt(sum(weight * error ** 2 for weight, error in zip(weights, errors)) / sum(weights))
    return result

#if the score is far from mean (high or low scores), weight those reviews and ratings more when assessing model accuracy
def calc_weights(scores):
    peak = 6.851
    return tuple((10 ** (0.3556 * (peak - score))) if score < peak else (10 ** (0.2718 * (score - peak))) for score in scores)


def assess_model( model_name, test, predicted):
    error = test - predicted
    rmse = calc_rmse(error)
    mae = mean_absolute_error(test, predicted)
    weights = calc_weights(test)
    weighted_rmse = calc_rmse(error, weights = weights)
    
    
    print(model_name)
    print('RMSE:',rmse)
    print('Weighed RMSE:', weighted_rmse)
    print('MAE:', mae)

Lets do MNB model to predict the rating based on comments

The 30 thousands unbalanced data was splitted as train and test set for the modeling, then pipeline was used to tune the model.

  • count_vectorizer - Breaks up the text into a matrix with each word (called "token" in NLP) being the column of the matrix and the value being the count of occurences.

  • ngram_range - Optional parameter to extract the text in groups of 2 or more words together. This is useful because the modifiers such as 'not' can be used to change the following word's meaning.

  • stopwords - Removes any words from the stopwords list created in the data exploration step.
  • lowercase - Converts all text into lowercase.
  • tfidf_transformer - Weighs terms by importance to help with feature selection.
  • classifier - two types of models was used, multi-class classification, Multinomial NB and LinearSVC

Model performance will be judged with the accuracy value

In [19]:
X_train, X_test, y_train, y_test = train_test_split(review_data3.cleaned, review_data3.rating, random_state=44,test_size=0.20)

model_nb = Pipeline([
    ('count_vectorizer', CountVectorizer(lowercase = True, stop_words = stopwords.words('english'))), 
    ('tfidf_transformer',  TfidfTransformer()), #weighs terms by importance to help with feature selection
    ('classifier', MultinomialNB()) ])
    
model_nb.fit(X_train,y_train.astype('int'))
labels = model_nb.predict(X_test)
mat = confusion_matrix(y_test.astype('int'), labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False )
plt.xlabel('true label')
plt.ylabel('predicted label');
plt.show()

assess_model("Multinomial NB", y_test,labels)
acc = accuracy_score(y_test.astype('int'),labels, normalize=True) * float(100)
print('\n****Test accuracy is',(acc))
Multinomial NB
RMSE: 1.7390785011429706
Weighed RMSE: 3.8230391348432726
MAE: 1.292360885

****Test accuracy is 26.866666666666667

The accuracy of MNB model is 27 % for unbalanced sample but the problem of the above model is that it did not predict rating 1-4 and 8-9 (confusion Matrix), it has only predicted around the average value of the rating (6.85)

Lets do MNB model with N-grams to predict the rating based on comments

In [20]:
#Experimented with adding different numbers of n-grams, 1-2 seems to have best performance
model_nb2 = Pipeline([
    ('count_vectorizer', CountVectorizer( ngram_range=(1,2), lowercase = True, stop_words = stopwords.words('english'))), 
    ('tfidf_transformer',  TfidfTransformer()), #weighs terms by importance to help with feature selection
    ('classifier', MultinomialNB()) ])
    
model_nb2.fit(X_train,y_train.astype('int'))
labels = model_nb2.predict(X_test)
mat = confusion_matrix(y_test.astype('int'), labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False )
plt.xlabel('true label')
plt.ylabel('predicted label');
plt.show()

assess_model("Multinomial NB n-grams 1-2", y_test,labels)
acc = accuracy_score(y_test.astype('int'),labels, normalize=True) * float(100)
print('\n****Test accuracy is',(acc))
Multinomial NB n-grams 1-2
RMSE: 1.7543120303424782
Weighed RMSE: 3.871180517324755
MAE: 1.3020942183333333

****Test accuracy is 26.883333333333333

The accuracy of MNB model with N-grams is almost same of MNB (27 %) for unbalanced sample but the problem of the above model is that it also did not predict rating 1-4 and 8-9 (confusion Matrix), it has only predicted around the average value of the rating (6.85).

Lets check with different hyperparameter for MNB Model

In [21]:
# Convert the data in vector fpormate
tf_idf_vect = TfidfVectorizer(ngram_range=(1,2))
tf_idf_train = tf_idf_vect.fit_transform(X_train)
tf_idf_test = tf_idf_vect.transform(X_test)

alpha_range = list(np.arange(0,30,1))
len(alpha_range)
Out[21]:
30

Take different values of alpha in cross validation and finding the accuracy score

In [22]:
from sklearn.naive_bayes import MultinomialNB
y_train=y_train.astype('int')

alpha_scores=[]

for a in alpha_range:
    clf = MultinomialNB(alpha=a)
    scores = cross_val_score(clf, tf_idf_train, y_train, cv=5, scoring='accuracy')
    alpha_scores.append(scores.mean())
    print(a,scores.mean())
/opt/conda/lib/python3.7/site-packages/sklearn/naive_bayes.py:507: UserWarning: alpha too small will result in numeric errors, setting alpha = 1.0e-10
  'setting alpha = %.1e' % _ALPHA_MIN)
/opt/conda/lib/python3.7/site-packages/sklearn/naive_bayes.py:507: UserWarning: alpha too small will result in numeric errors, setting alpha = 1.0e-10
  'setting alpha = %.1e' % _ALPHA_MIN)
/opt/conda/lib/python3.7/site-packages/sklearn/naive_bayes.py:507: UserWarning: alpha too small will result in numeric errors, setting alpha = 1.0e-10
  'setting alpha = %.1e' % _ALPHA_MIN)
/opt/conda/lib/python3.7/site-packages/sklearn/naive_bayes.py:507: UserWarning: alpha too small will result in numeric errors, setting alpha = 1.0e-10
  'setting alpha = %.1e' % _ALPHA_MIN)
/opt/conda/lib/python3.7/site-packages/sklearn/naive_bayes.py:507: UserWarning: alpha too small will result in numeric errors, setting alpha = 1.0e-10
  'setting alpha = %.1e' % _ALPHA_MIN)
0 0.2229166666666667
1 0.2659166666666667
2 0.2637083333333333
3 0.26308333333333334
4 0.26279166666666665
5 0.2628333333333333
6 0.2628749999999999
7 0.26266666666666666
8 0.26266666666666666
9 0.2625833333333333
10 0.26245833333333335
11 0.26241666666666663
12 0.2622083333333333
13 0.2622083333333333
14 0.2622083333333333
15 0.262125
16 0.262125
17 0.26216666666666666
18 0.26216666666666666
19 0.2620416666666666
20 0.262
21 0.262
22 0.2619166666666667
23 0.2619166666666667
24 0.2619166666666667
25 0.2619166666666667
26 0.2619166666666667
27 0.2619166666666667
28 0.2619166666666667
29 0.2619166666666667
In [23]:
# Plot b/w misclassification error and CV mean score.
import matplotlib.pyplot as plt

MSE = [1 - x for x in alpha_scores]


optimal_alpha_bnb = alpha_range[MSE.index(min(MSE))]

# plot misclassification error vs alpha
plt.plot(alpha_range, MSE)

plt.xlabel('hyperparameter alpha')
plt.ylabel('Misclassification Error')
plt.show()
In [24]:
optimal_alpha_bnb
Out[24]:
1
  • It is found that hyper-parameter 1 performs best to predict the rating,our MNB automatically use hyperparameter 1. so already did this model.
In [25]:
model_nb = Pipeline([
    ('count_vectorizer', CountVectorizer(lowercase = True, stop_words = stopwords.words('english'))), 
    ('tfidf_transformer',  TfidfTransformer()), #weighs terms by importance to help with feature selection
    ('classifier', MultinomialNB(alpha=optimal_alpha_bnb)) ])
    
model_nb.fit(X_train,y_train.astype('int'))
labels = model_nb.predict(X_test)
mat = confusion_matrix(y_test.astype('int'), labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False )
plt.xlabel('true label')
plt.ylabel('predicted label');
plt.show()

assess_model("Multinomial NB", y_test,labels)
acc = accuracy_score(y_test.astype('int'),labels, normalize=True) * float(100)
print('\n****Test accuracy is',(acc))
Multinomial NB
RMSE: 1.7390785011429706
Weighed RMSE: 3.8230391348432726
MAE: 1.292360885

****Test accuracy is 26.866666666666667

We have already done this using MNB model.

Lets try Linear SVC Model

In [26]:
model_svc = make_pipeline(TfidfVectorizer(ngram_range=(1,3)), svm.SVC(kernel="linear",probability=True))
model_svc.fit(X_train, y_train.astype('int'))
labels = model_svc.predict(X_test)

mat = confusion_matrix(y_test.astype('int'), labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False )
plt.xlabel('true label')
plt.ylabel('predicted label');
plt.show()

assess_model("Linear SVC model", y_test,labels)
acc = accuracy_score(y_test.astype('int'),labels, normalize=True) * float(100)
print('\n****Test accuracy is',(acc))
Linear SVC model
RMSE: 1.6323990116199172
Weighed RMSE: 3.4626254301869106
MAE: 1.190857885

****Test accuracy is 29.95

SVC model performs better than any others model as Accuracy= 29% which is greater than MNB and MNB-N-grams

All the above models are simply predicting reviews around the average rating accept SVC model. MNB and MNB-N-grams did not predict any low reviews. This is because the training data is so unbalanced that it can't detect a negative review.

To overcome this situation. we need to create a balnace data set as the subset of orinal data to train the model, then we need to use that model to predict on the unbalanced sample. In this case, we will train SVC model as it performs better than other two.

Let us create a Balanced sample that have 2000 sample from each rating

In [27]:
review_data2.head()
rating1_subset = review_data2[review_data2['rating']==1] 
rating1_subset.head()

# Slect 100 sample that have rating =1
r1=rating1_subset.sample(2000)
r1.head()


rating2_subset = review_data2[review_data2['rating']==2] 
rating2_subset.head()
# Slect 100 sample that have rating =2
r2=rating2_subset.sample(2000)
r2.head()

rating3_subset = review_data2[review_data2['rating']==3] 
rating3_subset.head()
# Slect 100 sample that have rating =3
r3=rating3_subset.sample(2000)
r3.head()

rating4_subset = review_data2[review_data2['rating']==4] 
rating4_subset.head()
# Slect 100 sample that have rating =4
r4=rating4_subset.sample(2000)
r4.head()

rating5_subset = review_data2[review_data2['rating']==5] 
rating5_subset.head()
# Slect 100 sample that have rating =5
r5=rating5_subset.sample(2000)
r5.head()

rating6_subset = review_data2[review_data2['rating']==6] 
rating6_subset.head()
# Slect 100 sample that have rating =6
r6=rating6_subset.sample(2000)
r6.head()

rating7_subset = review_data2[review_data2['rating']==7] 
rating7_subset.head()
# Slect 100 sample that have rating =7
r7=rating7_subset.sample(2000)
r7.head()

rating8_subset = review_data2[review_data2['rating']==8] 
rating8_subset.head()
# Slect 100 sample that have rating =8
r8=rating8_subset.sample(2000)
r8.head()

rating9_subset = review_data2[review_data2['rating']==9] 
rating9_subset.head()
# Slect 100 sample that have rating=9
r9=rating9_subset.sample(2000)
r9.head()

rating10_subset = review_data2[review_data2['rating']==10] 
rating10_subset.head()
# Slect 100 sample that have rating=10
r10=rating10_subset.sample(2000)
r10.head()
Out[27]:
user rating comment ID name
4274325 Misiodziej 10.0 Bw:5 127023 Kemet
2507106 boulette de steak 10.0 It's my favorite game !!! Without doubt the be... 42 Tigris & Euphrates
8835445 Kejben 10.0 Great two player strategy game with a lot of l... 82421 Summoner Wars: Phoenix Elves vs Tundra Orcs
3430448 Koert 10.0 Excellent epic complex card-driven civilisatio... 25613 Through the Ages: A Story of Civilization
5067212 Hummingbirdmagic 10.0 I have a great interest in history, especially... 171668 The Grizzled

Now combined all 20,000 samples with 2000 samples for each rating-Balanced Sample

In [28]:
review_balance=df = r1.append([r2, r3,r4,r5,r6,r7,r8,r9,r10])
review_balance.head()
Out[28]:
user rating comment ID name
681566 Numskull 1.0 A twelve hour war game disguised as a civiliza... 3870 7 Ages
462726 mgringo 1.0 N/C. not worth it. 16398 War
6491275 newkillerstar27 1.0 Far too long and far too random. 24310 The Red Dragon Inn
9558637 LosSchabossDragon 1.0 The whole theme is distasteful. Ok a lot of em... 65282 Tanto Cuore
2412076 lortelars 1.0 An incredibly boring game with "roll to move" ... 1406 Monopoly
In [29]:
review_balance.shape
Out[29]:
(20000, 5)
  • so, our balanced sample has total 2000 rows from 200 samples for each rating. Lets clean these balance sample once again

Making lowercase, removing punctuation and stop words from Balanced sample

In [30]:
#lowercase and remove punctuation
review_balance['cleaned'] = review_balance['comment'].str.lower().apply(lambda x:''.join([i for i in x if i not in string.punctuation]))

# stopword list to use
stopwords_list = stopwords.words('english')
stopwords_list.extend(('game','play','played','players','player','people','really','board','games','one','plays','cards','would')) 

stopwords_list[-10:]

#remove stopwords
review_balance['cleaned'] = review_balance['cleaned'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stopwords_list)]))
review_balance.head()
Out[30]:
user rating comment ID name cleaned
681566 Numskull 1.0 A twelve hour war game disguised as a civiliza... 3870 7 Ages twelve hour war disguised civilization dominat...
462726 mgringo 1.0 N/C. not worth it. 16398 War nc worth
6491275 newkillerstar27 1.0 Far too long and far too random. 24310 The Red Dragon Inn far long far random
9558637 LosSchabossDragon 1.0 The whole theme is distasteful. Ok a lot of em... 65282 Tanto Cuore whole theme distasteful ok lot employers must ...
2412076 lortelars 1.0 An incredibly boring game with "roll to move" ... 1406 Monopoly incredibly boring roll move bane mechanics com...

Now lets see the balanced rating

In [31]:
#plot histogram of ratings
num_bins = 70
n, bins, patches = plt.hist(review_balance.rating, num_bins, facecolor='green', alpha=0.9)

#plt.xticks(range(9000))
plt.title('Histogram of Ratings')
plt.xlabel('Ratings')
plt.ylabel('Count')
plt.show()

If you see the above bar diagram, then it is clear that all rating have same number of samples which we named balanced sample. Now lets work with this balance sample. First, we will re-train our SVC model with balanced sample, then we will apply this model to predict test set from unbalanced sanple.

Now we are ready to re-train our SVC model with the balance data

In [32]:
X_train1, X_test1, y_train1, y_test1 = train_test_split(review_balance.cleaned, review_balance.rating, test_size=0.20)
model_svc_balance = make_pipeline(TfidfVectorizer(ngram_range=(1,3)), svm.SVC(kernel="linear",probability=True))
model_svc_balance.fit(X_train1, y_train1.astype('int'))
labels = model_svc_balance.predict(X_test1)

mat = confusion_matrix(y_test1.astype('int'), labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False )
plt.xlabel('true label')
plt.ylabel('predicted label');
plt.show()

assess_model("Linear SVC Balanced model", y_test1,labels)
acc = accuracy_score(y_test1.astype('int'),labels, normalize=True) * float(100)
print('\n****Test accuracy is',(acc))
Linear SVC Balanced model
RMSE: 2.6094060626893625
Weighed RMSE: 2.912383001169844
MAE: 1.8385

****Test accuracy is 23.65

** Although now it has captured all rating catagory but the accuracy is less than unbalanced sample.

Finally use the retrain SVC model to test the prediction of unbalanced data

In [33]:
X_train, X_test, y_train, y_test = train_test_split(review_data3.cleaned, review_data3.rating, test_size=0.20)
labels = model_svc_balance.predict(X_test)

mat = confusion_matrix(y_test.astype('int'), labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False )
plt.xlabel('true label')
plt.ylabel('predicted label');
plt.show()

assess_model("Linear SVC model", y_test,labels)
acc = accuracy_score(y_test.astype('int'),labels, normalize=True) * float(100)
print('\n****Test accuracy of re-trained SVC is',(acc))
Linear SVC model
RMSE: 2.3992849035094483
Weighed RMSE: 2.736905640073932
MAE: 1.8000890833333336

****Test accuracy of re-trained SVC is 19.983333333333334

It seems that the accuracy has decreased after using balnced data as the training. Although it has captured all rating class but the error rate is high as we have got accuracy only 29% while it was 30% from SVC model in case of unbalanced data but it was biased results.

Lets try us with Ensemble model to see the performance.

Create ensemble model using VotingClassifier

In [34]:
X_train, X_test, y_train, y_test = train_test_split(review_data3.cleaned, review_data3.rating, test_size=0.20)

Ensemble = VotingClassifier(estimators=[('model_svc_unbalance',model_svc), ('model_svc_balance', model_svc_balance )],
                        voting='soft',
                        weights=[3, 1])

Ensemble.fit(X_train,y_train.astype(int))


labels = Ensemble.predict(X_test)
mat = confusion_matrix(y_test.astype(int), labels)
ax = sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False )
plt.xlabel('true label')
plt.ylabel('predicted label');
plt.show()
assess_model("Ensemble model", y_test,labels)
acc = accuracy_score(y_test.astype('int'),labels, normalize=True) * float(100)
print('\n****Test accuracy of Ensemble SVC is',(acc))
Ensemble model
RMSE: 1.665938821715081
Weighed RMSE: 3.013509632025085
MAE: 1.2075596700000002

****Test accuracy of Ensemble SVC is 28.833333333333332

Ensemble model with voting classifier shows only 29% accuracy which is still same with our SVC-unbalanced model (29%).

#join results of SVC models on balanced and unbalanced data to create ensemble model

In [35]:
X_train, X_test, y_train, y_test = train_test_split(review_data3.cleaned, review_data3.rating, test_size=0.20)

labels = model_svc.predict(X_test)
labels_2 = model_svc_balance.predict(X_test)


pred = pd.concat([pd.DataFrame(y_test).reset_index().rating,pd.Series(labels),pd.Series(labels_2)],axis=1)
pred.columns = ['rating','model_1','model_2']

pred = pd.concat([pd.DataFrame(y_test).reset_index().rating,pd.Series(labels),pd.Series(labels_2)],axis=1)
pred.columns = ['rating','model_1','model_2']

pred['final'] = np.where(pred.model_2 >= 3, np.where(pred.model_2 <= 9, pred.model_1, pred.model_2), pred.model_2)
#pred['final'] = np.where(pred.model_2 <= 9, pred.model_1, pred.model_2)
pred.tail()
Out[35]:
rating model_1 model_2 final
5995 7.0 7 4 7
5996 7.0 7 5 7
5997 6.5 6 7 6
5998 7.5 6 7 6
5999 7.0 7 6 7
In [36]:
mat = confusion_matrix(pred.rating.astype(int), pred.final)
ax = sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False )
plt.xlabel('true label')
plt.ylabel('predicted label');
plt.show()
assess_model("Ensemble model", pred.rating,pred.final)

acc = accuracy_score(pred.rating.astype(int),pred.final, normalize=True) * float(100)
print('\n****Test accuracy of Ensemble SVC is',(acc))
Ensemble model
RMSE: 1.7523654870036909
Weighed RMSE: 2.3686473512013473
MAE: 0.8857402483333333

****Test accuracy of Ensemble SVC is 66.95

WOOOW! Finally we have got 66% accuracy which is best than any others model we discussed here. It has also captured all kinds of rating from our data set rather than capturing rating close the mean.

Summary

Models Name and Accuracy:

Multinomial Naïve Bayes(MNB) for Unbalanced dataset 27%

MNB with N-grams for Unbalanced dataset 27%

SVC-Linear for Unbalanced dataset 29%

SVC-Linear Re-trained with Balanced Dataset 25%

SVC-Linear Re-trained on Unbalanced dataset 20%

Ensemble Model with Voting Classifier 29%

Ensemble Model with Balanced and Unbalanced SVC 66%

To obtain the best accuracy MNB, MNB with N-grams, linear-SVC, Ensemble models was conducted. The accuracy for different models in case of unbalaced data set was: MNB accuaracy 27% with hyperparameter 1, MNB with N-grams 27% and SVC accuracy was 29%. But due to unbalaced data, all these models captured prediction around the mean, it was not able to capture most of the negative rating (<5) and very high positive rating (>8). SVC models performs better than MNB and MNB-N-grams and it captured both low and high rating. To overcome this problem a 20,000 balanced sample was created by taking 2000 samples from each rating. Then the SVC model was trained with balanced sample, after that SVC model was used to predict on unbalanced dataset. It was found that balanced training model accuracy was less than unbalced models, but it was able to capture all kinds of rating. SVC Balanced accuracy was 25% and SVC-unbalanced test accuracy was 20%, here we train and test SVC models only as we got previously that SVC performs better than MNB and MNB-N-grams in terms of best accuracy of the prediction. Voting Ensemable shows accuacy 29% while Joining Ensemble shows 66% accuracy which is outstanding results for this project. SVC balnced and SVC unbalced model was joined in this best model to predict the rating on unbalanced dataset. This study concludes that Ensemble model performs best than any other models with highest accuarcy.

Challenges and Improvements

The challanging of this project was handling big data,selecting the sample size and finding the best machine leaning algorithm and impliment them in a proper way to make the model best performed. I tried with different sample sizes, accuracy varied with the sample size. I have taken the sample size that can be handled with kaggle server and minimize run time, at the same time, shows good accuaracy rate. One of the another challanges was finding the good accuarcy from the model. This projects has given overview of performance of different models in terms of accuracy over the existing references.

References